Inflating Training Data for Statistical Machine Translation using Unaligned Monolingual Data

نویسندگان

Wei YANG

Zhongwen ZHAO

Yves LEPAGE

چکیده

In data-driven machine translation, parallel corpora are an extremely important resource. For language pairs that involve English, there exist many freely available bilingual or multilingual parallel corpora, especially for European languages. To improve the translation quality for less-resourced language pairs, such as Chinese–Japanese, larger and larger aligned training data are needed. The constitution of large bilingual corpora is not easy for less documented language pairs. In this paper, we show how to construct a Chinese–Japanese quasi-parallel corpus automatically by using analogical associations based on a small amount of parallel sentences and a reasonable amount of monolingual data. We perform SMT experiments in Chinese–Japanese and compare a baseline system and a system build by adding the quasi-parallel corpus. On the same test set, the translation quality significantly improved over the baseline system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Forms Wanted: Training SMT on Monolingual Data

We propose and evaluate a simple technique of “reverse self-training” for statistical machine translation. The technique allows to extend target-side vocabulary of the MT system using target-side monolingual data and it is especially aimed at translation to morphologically rich languages.

متن کامل

Improving Neural Machine Translation Models with Monolingual Data

Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Monolingual data plays an important role in boosting fluency for phrase-based statistical machine translation, and we investigate the use of monolingual data for neural machine translation (NMT). In contrast to previous work, which integrates a sepa...

متن کامل

Improved Statistical Machine Translation Using Monolingual Paraphrases

We propose a novel monolingual sentence paraphrasing method for augmenting the training data for statistical machine translation systems “for free” – by creating it from data that is already available rather than having to create more aligned data. Starting with a syntactic tree, we recursively generate new sentence variants where noun compounds are paraphrased using suitable prepositions, and ...

متن کامل

Improving word alignment for low resource languages using English monolingual SRL

We introduce a new statistical machine translation approach specifically geared to learning translation from low resource languages, that exploits monolingual English semantic parsing to bias inversion transduction grammar (ITG) induction. We show that in contrast to conventional statistical machine translation (SMT) training methods, which rely heavily on phrase memorization, our approach focu...

متن کامل

Cross-lingual spoken language understanding from unaligned data using discriminative classification models and machine translation

This paper investigates several approaches to bootstrapping a new spoken language understanding (SLU) component in a target language given a large dataset of semantically-annotated utterances in some other source language. The aim is to reduce the cost associated with porting a spoken dialogue system from one language to another by minimising the amount of data required in the target language. ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2015

Inflating Training Data for Statistical Machine Translation using Unaligned Monolingual Data

نویسندگان

چکیده

منابع مشابه

Forms Wanted: Training SMT on Monolingual Data

Improving Neural Machine Translation Models with Monolingual Data

Improved Statistical Machine Translation Using Monolingual Paraphrases

Improving word alignment for low resource languages using English monolingual SRL

Cross-lingual spoken language understanding from unaligned data using discriminative classification models and machine translation

عنوان ژورنال:

اشتراک گذاری